Introduction

Coronavirus disease(COVID19) is an infectious disease caused by a newly discovered coronavirus. It has spread to numerous countries across all continents since it was first discovery in Wuhan, China back in Nov 2019 and was declared as pandemic by WHO on March 11 2020.

Various countries has came out measure/restrictions to respond to COVID-19. Since “circuit breaker”, a partial nationwide lockdown, where only essential services were allowed to open, Singapore(SG) residents have started to feel a great impact on daily life where they are encouraged to stay home as much as possible and wearing of mask became mandatory when going out. SG government has constantly revising policies and social restrictions. Three phases of planned reopening were announced since 19 May namely “Safe Reopening” (Phase1) “Safer Transition” (Phase2), and finally “Safe Nation” (Phase3).

Problem Statement

Microblogging has become one of the most useful tools for sharing everyday life events and news and for expressing opinions about these events. As Twitter posts are short and constantly being generated, they are a great source for providing public sentiment towards events that occurred throughout the COVID-19 period in Singapore.

In our Capstone Project, we perform exploratory data analysis about SG COVID situation and sentiment analysis and modeling on the tweets about COVID19 to seek to answer the following research questions:

  1. What are the main prevalent sentiment and emotions expressed in words in Singapore tweets about current COVID situation?

  2. Is there any change of sentiment over a period of time amidst global reopening with higher vaccination rate, in contrast to growing new daily cases/death locally?

For our data science project, we activated the following packages, using the Tidyverse approach.

# Load necessary packages
pacman::p_load(tidyverse, broom, modelr, lubridate, 
               tidytext, wordcloud2, wordcloud, reshape2,
               textdata, huxtable,  # Employing Lexicon
               gridExtra, psych, Hmisc, sandwich, 
               scales, rtweet, glue, ggplot2, 
               caret, DT, dplyr)

my_colors <- c("#05A4C0", "#85CEDA", "#D2A7D8", "#A67BC5", "#BB1C8B", "#8D266E", "gold4", "darkred", "deepskyblue4")

my_theme <- theme(plot.background = element_rect(fill = "grey98", color = "grey20"),
                  panel.background = element_rect(fill = "grey98"),
                  panel.grid.major = element_line(colour = "grey87"),
                  text = element_text(color = "grey20"),
                  plot.title = element_text(size = 22),
                  plot.subtitle = element_text(size = 17),
                  axis.title = element_text(size = 15),
                  axis.text = element_text(size = 15),
                  legend.box.background = element_rect(color = "grey20", fill = "grey98", size = 0.1),
                  legend.box.margin = margin(t = 3, r = 3, b = 3, l = 3),
                  legend.title = element_blank(),
                  legend.text = element_text(size = 15),
                  strip.text = element_text(size=17))

Import

Then, we imported our dataset.

Data Source 1: SG COVID DATA

SSA <- readRDS("covid19.rds")

The dataset contains time series data on covid-19 cases in Singapore on counts of confirmed, discharged, hospitalized, deaths, imported cases.

Within the dataset, Some of few key fields we are interested in reflecting current Singapore COVID-19 situation, namely Daily.Confirmed, Daily Deaths, Still.Hospitalised and Intensive.Care.Unit..ICU which will also serve as our dependent (discrete x) variable.

Notes:

  • All figures (excluding variables with names containing “MOH report”) are as at press release within the day and are not back-dated to update any changes that the Ministry of Health (MOH) might have made.

  • “Daily imported” and “Daily Local transmission” make up “Daily Confirmed”.

  • Still Hospitalised" is computed based on “Total Confirmed” - “Cumulative Discharged” - “Discharged to Isolation” - “Cumulative Deaths” - “Tested positive demise” (summed). This might not tally with the sum of “General Wards MOH report” and “Intensive Care Unit (ICU)”, indicating dirty data.

  • “Cumulative Vaccine Doses”, “Cumulative Individuals Vaccinated”, and “Cumulative Individuals Vaccination Completed” added from 1 Jul 2021. The data is tied to the date of the report but the count is as of the previous day. i.e. Figures indicated for 1 Jul 2021 reflect the total number of doses of COVID-19 vaccines as of Jun 30. “Cumulative Individuals Vaccinated” refers to the number of individuals who have received at least one dose of vaccine.

glimpse(SSA)
## Rows: 999
## Columns: 37
## $ Date                                                          <chr> "2020-01…
## $ Daily.Confirmed                                               <int> 1, 2, 1,…
## $ False.Positives.Found                                         <int> NA, NA, …
## $ Cumulative.Confirmed                                          <int> 1, 3, 4,…
## $ Daily.Discharged                                              <int> 0, 0, 0,…
## $ Passed.but.not.due.to.COVID                                   <int> 0, 0, 0,…
## $ Cumulative.Discharged                                         <int> 0, 0, 0,…
## $ Discharged.to.Isolation                                       <int> 0, 0, 0,…
## $ Still.Hospitalised                                            <int> 1, 3, 4,…
## $ Daily.Deaths                                                  <int> 0, 0, 0,…
## $ Cumulative.Deaths                                             <int> 0, 0, 0,…
## $ Tested.positive.demise                                        <int> 0, 0, 0,…
## $ Daily.Imported                                                <int> 1, 2, 1,…
## $ Daily.Local.transmission                                      <int> 0, 0, 0,…
## $ Local.cases.residing.in.dorms.MOH.report                      <int> NA, NA, …
## $ Local.cases.not.residing.in.doms.MOH.report                   <int> NA, NA, …
## $ Intensive.Care.Unit..ICU.                                     <int> 0, 0, 0,…
## $ General.Wards.MOH.report                                      <int> NA, NA, …
## $ In.Isolation.MOH.report                                       <int> NA, NA, …
## $ Total.Completed.Isolation.MOH.report                          <int> NA, NA, …
## $ Total.Hospital.Discharged.MOH.report                          <int> NA, NA, …
## $ Requires.Oxygen.Supplementation.or.Unstable                   <int> NA, NA, …
## $ Linked.community.cases                                        <int> NA, NA, …
## $ Unlinked.community.cases                                      <int> NA, NA, …
## $ Phase                                                         <chr> "", "", …
## $ Cumulative.Vaccine.Doses                                      <int> NA, NA, …
## $ Cumulative.Individuals.Vaccinated                             <int> NA, NA, …
## $ Cumulative.Individuals.Vaccination.Completed                  <int> NA, NA, …
## $ Perc.population.completed.at.least.one.dose                   <chr> "", "", …
## $ Perc.population.completed.vaccination                         <chr> "", "", …
## $ Sinovac.vaccine.doses                                         <int> NA, NA, …
## $ Cumulative.individuals.using.Sinovac.vaccine                  <int> NA, NA, …
## $ Doses.of.other.vaccines.recognised.by.WHO                     <int> NA, NA, …
## $ Cumulative.individuals.using.other.vaccines.recognised.by.WHO <int> NA, NA, …
## $ Number.taken.booster.shots                                    <int> NA, NA, …
## $ Perc.population.taken.booster.shots                           <chr> "", "", …
## $ X                                                             <lgl> NA, NA, …

Data Source 1: Tidy & Transform

The first thing we did with our loaded dataset was to remove the non-relevant columns and remain only those we will perform analysis and modeling. We also transformed the dataset into long format for data exploratory visualization.

SSA1<- tibble(SSA)

SSA1 <- SSA1[-(1:626) , -(18:37)]
SSA1 <- SSA1[ , -(11:16)]  
SSA1 <- SSA1[ , -(3:8)]
SSA1 <- SSA1[-(35:373) , ]

SSA1$Date <- as.Date(SSA1$Date)

#This illustrate a comparison of the daily cases for Death, Confirmed, Hospitalised and ICU over the study period

SSA_chart <- SSA1 %>% pivot_longer(cols = Daily.Confirmed:Intensive.Care.Unit..ICU. , 
                                    names_to = "Cases", 
                                    values_to = "Value")


COLORS <- c(Daily.Confirmed = "#c381fd", Daily.Death ="#4815aa",  
            Still.Hospitalised = "#f2626b" , Intensive.Care.Unit..ICU. = "#feba4f")

ggplot(SSA_chart, aes(x = Date, y = Value, group = Cases, color = Cases)) +
  geom_line(size = 0.9) +
  scale_color_manual(values = COLORS)+
  scale_y_continuous("Cases", limits = c(0,5500)) + 
  labs(title="Comparison of Daily Cases\nfor Death, Confirmed, Hospitalised and ICU")+
  theme(legend.title = element_text(color = "blue", size = 10)) +
  my_theme + theme(axis.title.x = element_blank(),
                   legend.position = "bottom") +
  scale_x_date(date_breaks = "1 day") + 
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 )) 

Data Source 2: SG TWEETER DATA

# We observed 7-days data usually capped below 3000 tweets per extraction.
# sg_tweets <- search_tweets(q = "#coronavirus OR #covid19 OR #COVID OR #stayhome OR #Covid-19 OR #pandemic OR #virus OR #social-distance OR #self-quarantine OR $swab-test OR #PCR OR #infection-rate", 
#                                         n = 3000, 
#                                         lang = "en",
#                                         include_rts = F,
#                                         geocode = lookup_coords("singapore")
                                
sg_tweets <- readRDS("tweeter_data.rds")

Let’s explore our tweets data in regards to COVID-19 from our first extraction on 18th October to understand sentiment after recent sharp rise in number of local cases and death since end-September.

We also identified 2 key events over the period to analyse further to answer our research question if the event will have significance influence on the public sentiment.

2021-10-20

  • PM Lee’s address on COVID-19 situation
  • Announcement on the extension of the Stabilisation Phase for four weeks, through to 21 November 2021.
  • Unvaccinated people can no longer eat at hawker centres and enter shopping malls.

2021-11-08

  • Allow up to five fully vaccinated persons from the same household to dine-in together at food and beverage (F&B) establishments
  • Loose restrictions on sports and selected MICE (Meetings, Incentives, Conferences and Exhibitions) events.
  • Resuming more activities in schools, in preparation for the larger-scale safe resumption of co-curricular learning activities in the coming school year.
  • Adjusting border measures and extending Vaccinated Travel Lanes(VTL) to Malaysia, Finland and Sweden.

It is worth noting that whenever there is a major announcement by the government, there will be jump on the number of tweets.

# Basic EDA on amount of tweet in time (ALL)
options(repr.plot.width=20, repr.plot.height=9)

sg_tweets %>% 
  select(created_at) %>% 
  mutate(date = ymd(as.Date(created_at))) %>% 
  group_by(date) %>% 
  summarise(n = n(), .groups = "drop_last") %>%
  ggplot(aes(x=date, y = n)) + 
  geom_line(size = 1, color = my_colors[1]) +
  coord_cartesian(clip = 'off') +
  geom_vline(xintercept = as.Date('2021-10-20'), linetype="dotted", size = 1.5, color = "red") +
  geom_vline(xintercept = as.Date('2021-11-08'), linetype="dotted", size = 1.5, color = "darkblue") +
  my_theme + theme(axis.title.x = element_blank()) +
  scale_x_date(date_breaks = "1 day") + 
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 )) +
  labs(title = "Number of COVID-19 Tweets shared between 10th Oct - 15th Nov", subtitle = "Number of tweets spiked on key events") +
    geom_label(aes(x=as.Date('2021-10-19'), y = 380, label = "PM Lee's address on COVID-19"), color = "red", size = 4, angle = 90, fontface = "bold") +
    geom_label(aes(x=as.Date('2021-11-07'), y = 380, label = "More Opening on COVID-19 restrictions"  ), color = "darkblue", size = 4, angle = 90, fontface = "bold") 

Data Source 2: Tidy & Transform

# Step 1: Tokenization ----
sg_tweets_id <- sg_tweets %>% 
  mutate(created_at = as.Date(sg_tweets$created_at)) %>% 
  rowid_to_column("id")

tidy_tweets_token <- sg_tweets_id %>%
  drop_na(text) %>% 
  select(id, created_at, status_id, text) %>% 
  filter(text != "") %>% 
  unnest_tokens(word, text, token = "tweets") 

# Step 2: Cleaning ----
tweets_cleaned <- tidy_tweets_token %>%
  anti_join(tidytext::stop_words)

# Manual cleaning, filtering out unwanted words
tweets_token_cleaned <- tweets_cleaned %>%
  filter(!word %in% c("singapore", "covid", "covid19","positive","negative","oct","nov","news","amp","reuters","news","daily","malaysia","november","october","october","press","journal","amid","weekly","days","weeks","china","chinese","report","am","pm","dont","taking","found","morning","bloomberg","months","month","india","media","week","read","reports","data","europe","monday","tuesday","wednesday","thursday","friday","satursday","sunday","wall","street") & !str_detect(word,"^#|^@") & !str_detect(word, "^[:digit:]+$"))

Visualisation for Basic Exploratory Data Analysis

A Simple Word Cloud

covid_tweets_for_wc <- tweets_token_cleaned %>% 
  group_by(word) %>% 
  summarise(frequency = n()) %>% 
  arrange(desc(frequency))

covid_tweets_for_wc %>% 
  filter(frequency > 3) %>% 
  wordcloud2(backgroundColor = "black", 
             color = "random-light")

Word Cloud (Positive vs Negative)

# A Postive-Negative Word Cloud by using BING
BING <- get_sentiments("bing")

tweets_token_cleaned %>% 
  inner_join(BING, by="word") %>%
  count(word, sentiment, sort=T) %>% 
  acast(word ~ sentiment, value.var = "n", fill=0) %>% 
  comparison.cloud(colors=my_colors[c(5, 1)], max.words = 400, title.size = 2,
                   scale = c(3,.5))

Top 3 Most Negative Tweets in the dataset

AFINN <- get_sentiments("afinn")

## TOP 3 MOST NEGATIVE TWEET ----
tweets_AFINN_indexed <- tweets_token_cleaned %>% 
  inner_join(AFINN, by = "word")

tweet_level_sentiment <- tweets_AFINN_indexed %>% 
  group_by(id) %>% 
  summarise(average_sentiment = mean(value),
            n_of_words_indexed = n()
  )

top3_negative <- tweet_level_sentiment %>% 
  arrange(average_sentiment) %>% 
  head(3) 

sg_tweets_id %>% 
  filter(id %in% top3_negative$id ) %>% 
  select(text) %>% 
  pull()
## [1] "'They don't choose to have it': Why Covid-19 is hell for people with OCD"                                                                                                      
## [2] "Asked my mum to get me donuts 🍩 I’ve no idea why i did that when I can’t taste no shit. Arghh but just cravings 😫 fuck covid 19, fuck the monthly cycle, fuck everything! 😖"
## [3] "deadass my eyes went from covid 19 then positive bitch"

Top 3 Most Positive Tweets in the dataset

# TOP 3 MOST POSITIVE TWEETS ----
top3_positive <- tweet_level_sentiment %>% 
  arrange(desc(average_sentiment)) %>% 
  head(3)

sg_tweets_id %>% 
  filter(id %in% top3_positive$id) %>% 
  select(text) %>% 
  pull()
## [1] "…or else they find alternative employment. They are also bringing in a vax passport to enter some venues buildings etc, unless there is a current Covid-19 test (often valid only for a day or 2). It's basically becoming a no vax, no job, &amp; no fun situation. This I approve."                                                
## [2] "“MOH said the doctor had no known medical conditions and was partially vaccinated with a non-mRNA Covid-19 vaccine under the Special Access Route.”\n\nWow, this has to be stated so explicitly. I wonder why? \n\n⁦⁩ ⁦@Huigoon⁩"                                                                                                          
## [3] "@DavidBieleski @bergeron_brent @katiehasedits This trend was evident on 9Aug 2020, Back then out of the 3, SG had the most cases with 55,104, whilst NZ, the least with 1,219. Even back then Greece the highest with 213 COVID-19 related deaths.\n\nFast forward to 5 Nov 2021, Greece is the winner in Covid-19 cases &amp; deaths."

Overall Emotion Analysis

Distribution Breakdown by Emotion Class using NRC technique

NRC <- get_sentiments("nrc")

options(repr.plot.width=15, repr.plot.height=9)

tweets_token_cleaned %>% 
  inner_join(NRC, "word") %>%
  filter(!sentiment %in% c("positive", "negative")) %>% 
  count(sentiment, sort=T) %>% 
  ggplot(aes(x=reorder(sentiment, n), y=n)) +
  geom_bar(stat="identity", aes(fill=n), show.legend=F) +
  geom_label(aes(label=format(n, big.mark = ",")), size=5, fill="white") +
  labs(x="Sentiment", y="Frequency", title="What is the overall mood in Tweets?") +
  scale_fill_gradient(low = my_colors[3], high = my_colors[1], guide="none") +
  coord_flip() + 
  my_theme + theme(axis.text.x = element_blank())

Most Fequent Words by Emotion Class

#options(repr.plot.width=25, repr.plot.height=9)

tweets_token_cleaned %>% 
  inner_join(NRC, "word") %>% 
  count(sentiment, word, sort=T) %>%
  group_by(sentiment) %>% 
  arrange(desc(n)) %>% 
  slice(1:7) %>% 
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend = F) +
  facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 5) +
  coord_flip() +
  my_theme + theme(axis.text.x = element_blank()) +
  labs(x="Word", y="Frequency", title="Sentiment split by most frequent words") +
  scale_fill_manual(values = c(my_colors, "#BE82AF", "#9D4387", "#DEC0D7",
                               "#40BDC8", "#80D3DB", "#BFE9ED"))

Research on Influence from Singapore PM’s address

20 Oct 2021

Here, we are interested in a research question: We have been taking note of key events announce by the government and we are keen to know if the announcement from the leadership on 20 Oct will affect or change public sentiment.

  1. PM’s address the nation on COVID-19 situation in Singapore:
  • The path to a “New Normal”, diverted from original zero COVID approach and to live with COVID19.
  • local cases spiked sharply over the past few weeks
  • Asked for unity and COVID resilience.
  1. Announcement on COVID social curbs to be extended another month to 21 Nov. Originally slated to be in place until 24 Oct.
  • Dining out to 2 people
  • Work from home remains the default.

We are going to use Regression Discontinuity Analysis on the causal inference and effect.

Firstly, we explore the data with 10 days before and after PM Lee’s address, assuming date close to the cut off on 20 Oct has more relevant effects.

Data Overview

Using AFINN technique

sentiment_daily <- tweets_AFINN_indexed %>% 
  group_by(created_at) %>% 
  summarise(average_sentiment = mean(value),
            n_of_words_indexed = n()) 

  # Plot
sentiment_daily %>% 
  filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30')) %>% 
  ggplot(aes(x = created_at, y = average_sentiment) ) +
  geom_point(size = 2, color = my_colors[1]) +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1, linetype="dotdash", color = my_colors[6]) +
  scale_x_date(date_breaks = "1 day") + 
  ggtitle("Distribution of Average Sentiment \n10 days before & after PM address") +
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 ))

Using NRC technique for Emotion Classification

# Extract for analysis period
 tweets_token_analysis_period <- tweets_token_cleaned %>% 
  filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30')) 

classified_sentiment <- tweets_token_analysis_period %>% 
  inner_join(NRC, "word") %>% 
  group_by(sentiment, created_at) %>% 
  summarise(cnt = n()) 

# Plot Chart
classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>% 
  ggplot(aes(x=created_at, y =cnt, color = sentiment)) +
  geom_point() +
  facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 4) +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8]) +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d") +
    theme(axis.text.x = element_text(angle = 45, size = rel(0.8), vjust = 1, hjust = 1 )) +
  labs(y="Count of Emotional Words", x="Period of Date")

Using Radar Chart, another visualisation chart.

# Data transformation
char_sentiment <- classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
  group_by(sentiment, covid_address_effect) %>% 
  summarise(char_sentiment_count = sum(cnt)) 

total_char <-   classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
  group_by(covid_address_effect) %>% 
  summarise(total = sum(cnt))

# Plot Chart
char_sentiment %>% 
  inner_join(total_char, by="covid_address_effect") %>% 
  mutate(percent = char_sentiment_count / total * 100 ) %>% 
  select(-char_sentiment_count, -total) %>% 
  spread(covid_address_effect, percent)  %>% 
  radarchart::chartJSRadar(showToolTipLabel = T, main="Any Effects on the Emotion Class Percentage After Address? - No", maxScale=25, responsive=T,addDots = T, colMatrix = grDevices::col2rgb(my_colors[c(2,4)]),lineAlpha = 0.7, polyAlpha =0.2)

Simple Linear Regression

OLS Linear Regression on average sentiment over time period

merged_dataset_RDD <- SSA1 %>% 
  inner_join(sentiment_daily, by = c("Date" = "created_at")) %>% 
  filter(Date >= as.Date('2021-10-10') & Date <= as.Date('2021-10-30')) 

# add dummy variable for pre-effect = 0, and post-effect = 1
merged_dataset_RDD <- merged_dataset_RDD %>% 
  mutate (covid_address_effect = as.factor(ifelse(Date < '2021-10-20','Before','After')))

merged_dataset_RDD %>% 
  lm(average_sentiment ~ covid_address_effect + I(Date - as.Date('2021-10-20')),.) %>% 
  summary()
## 
## Call:
## lm(formula = average_sentiment ~ covid_address_effect + I(Date - 
##     as.Date("2021-10-20")), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36720 -0.09456  0.01933  0.08557  0.33828 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.336808   0.082404  -4.087 0.000691 ***
## covid_address_effectBefore      -0.240103   0.153254  -1.567 0.134596    
## I(Date - as.Date("2021-10-20")) -0.007816   0.012640  -0.618 0.544075    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1754 on 18 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1164 
## F-statistic: 2.318 on 2 and 18 DF,  p-value: 0.1271
merged_dataset_RDD %>% 
  ggplot(aes(x = Date, y = average_sentiment)) +
  geom_point(aes(color = covid_address_effect)) + 
  geom_smooth(method = "lm") +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  ggthemes::theme_fivethirtyeight() +
  ylab("Average Sentiment") +
  theme(axis.title.y = element_text(), legend.position = "bottom") +
  labs(title="OLS Simple Regression Model")

Regression Discontinuity Analysis

We perform Regression Discontinuity Analysis on the effects of PM address event.

We expect to observe there is a high volume on Oct. 20, jump in sentiment score/count on Oct. 19 and Oct. 21

# RDD
RDD <- merged_dataset_RDD %>% 
  ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  ggthemes::theme_fivethirtyeight() + 
  ggtitle("Regression Discontinuity Analysis") +
  ylab("Average Sentiment") +
  theme(axis.title.y = element_text()) +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])

# Difference in Means Test
RDD_box <- merged_dataset_RDD %>% 
  ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
  geom_boxplot(outlier.colour="black",
               outlier.size=2, notch=FALSE) + 
  geom_point() +
  ggthemes::theme_fivethirtyeight() + 
  ggtitle("Test for Significant Difference") +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])

gridExtra::grid.arrange(RDD, RDD_box, ncol=2)

Perform T-test to find significance in difference between 2 groups (Before and After PM address)

# Conduct a difference of means test
# Hypothesis: H0 : mean of pre-address_effect = mean of post-address_effect
merged_dataset_RDD %>%
  t.test(average_sentiment ~ covid_address_effect, .)
## 
##  Welch Two Sample t-test
## 
## data:  average_sentiment by covid_address_effect
## t = 2.1322, df = 17.963, p-value = 0.04705
## alternative hypothesis: true difference in means between group After and group Before is not equal to 0
## 95 percent confidence interval:
##  0.002298216 0.313764096
## sample estimates:
##  mean in group After mean in group Before 
##           -0.3758900           -0.5339212

Model

For the preparation of the model, we created and ran a correlational matrix, to see how our variables of interest (within the model) are related.

pacman::p_load(Hmisc, broom, DT)

merged_dataset <- SSA1 %>% 
  inner_join(sentiment_daily, by = c("Date" = "created_at")) %>% 
  filter(Date >= as.Date('2021-10-10') & Date <= as.Date('2021-11-12')) %>% 
  mutate (covid_address_effect = as.factor(ifelse(Date < '2021-10-20','Before','After')))

merged_dataset$Date <- as.Date(merged_dataset$Date)

merged_dataset <- merged_dataset[ , -(7:8)]
#Getting Summary
merged_dataset %>% 
  select("average_sentiment", "Daily.Confirmed", "Daily.Deaths","Still.Hospitalised", "Intensive.Care.Unit..ICU.") %>% 
  summary(.)
##  average_sentiment Daily.Confirmed  Daily.Deaths   Still.Hospitalised
##  Min.   :-0.7955   Min.   :1767    Min.   : 6.00   Min.   :1434      
##  1st Qu.:-0.5233   1st Qu.:2943    1st Qu.: 9.00   1st Qu.:1584      
##  Median :-0.4250   Median :3182    Median :12.00   Median :1640      
##  Mean   :-0.4001   Mean   :3206    Mean   :12.03   Mean   :1633      
##  3rd Qu.:-0.2721   3rd Qu.:3472    3rd Qu.:14.75   3rd Qu.:1686      
##  Max.   : 0.1036   Max.   :5324    Max.   :18.00   Max.   :1757      
##  Intensive.Care.Unit..ICU.
##  Min.   :41.00            
##  1st Qu.:58.25            
##  Median :64.00            
##  Mean   :61.88            
##  3rd Qu.:68.50            
##  Max.   :75.00
merged_dataset %>% 
  select(average_sentiment, Daily.Confirmed, Daily.Deaths, Still.Hospitalised,Intensive.Care.Unit..ICU.) %>%
  as.matrix(.) %>% 
  rcorr(.) %>% 
  tidy(.) %>% 
  rename(variable_1 = column1,
         variable_2 = column2,
         corr = estimate) %>% 
  mutate(abs_corr = abs(corr)
  )
variable_1variable_2corrnp.valueabs_corr
Daily.Confirmedaverage_sentiment-0.165 340.351 0.165 
Daily.Deathsaverage_sentiment0.115 340.518 0.115 
Daily.DeathsDaily.Confirmed0.0864340.627 0.0864
Still.Hospitalisedaverage_sentiment0.329 340.05730.329 
Still.HospitalisedDaily.Confirmed0.152 340.39  0.152 
Still.HospitalisedDaily.Deaths0.0675340.704 0.0675
Intensive.Care.Unit..ICU.average_sentiment0.307 340.07730.307 
Intensive.Care.Unit..ICU.Daily.Confirmed0.103 340.56  0.103 
Intensive.Care.Unit..ICU.Daily.Deaths0.168 340.342 0.168 
Intensive.Care.Unit..ICU.Still.Hospitalised0.415 340.01450.415 

It was observed that Intensive.Care Unit and Still Hospitalized has the highest correlation from the result.

We will use Merged Dataset for predicting average sentiment based on daily ICU, Confirmed, Hospitalized and Daily Deaths cases. We’ll randomly split the data into training set (70% for building a predictive model) and test set (30% for evaluating the model). Make sure to set seed for reproducibility.

# Split the data into training and test set
set.seed(1234)
training.samples <- merged_dataset$average_sentiment %>%
  createDataPartition(p = 0.7, list = FALSE)
train.data  <- merged_dataset[training.samples, ]
test.data <- merged_dataset[-training.samples, ]


#Note that, if you have many predictor variables in your data, you can simply include all the available variables in the model using ~.:
model <- lm(average_sentiment ~., data = train.data)

# Summarize the model
summary(model)
## 
## Call:
## lm(formula = average_sentiment ~ ., data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24035 -0.13448 -0.01832  0.07180  0.43221 
## 
## Coefficients:
##                               Estimate   Std. Error t value Pr(>|t|)
## (Intercept)               -93.42312312 126.42061809  -0.739    0.469
## Date                        0.00486328   0.00669085   0.727    0.476
## Daily.Confirmed            -0.00007253   0.00008996  -0.806    0.430
## Still.Hospitalised          0.00060785   0.00058399   1.041    0.310
## Daily.Deaths               -0.00254341   0.01248686  -0.204    0.841
## Intensive.Care.Unit..ICU.   0.00369593   0.00707699   0.522    0.607
## 
## Residual standard error: 0.2008 on 20 degrees of freedom
## Multiple R-squared:  0.2217, Adjusted R-squared:  0.02712 
## F-statistic: 1.139 on 5 and 20 DF,  p-value: 0.3723
# Make predictions
predictions <- model %>% predict(test.data)
RMSE(predictions, test.data$average_sentiment)
## [1] 0.2003003
# (b) R-square
R2(predictions, test.data$average_sentiment)
## [1] 0.2647791

Specify OLS model

We ran two regression models. The first regressed daily cases onto average sentiment (model1).

\[ \begin{eqnarray} \widehat{average.sentiment} = intercept + b_1Daily.Deaths + b_2Daily.Confirmed + b_3Still.Hospitalised + b_4Intensive.Care.Unit..ICU.+ \epsilon \end{eqnarray} \]

Our key investigation lies in the next model, in which we regressed daily cases orientations, along with interaction terms, onto average sentiment (model2).

\[ \begin{eqnarray} \widehat{average.sentiment} = intercept + b_1Daily.Deaths + b_2Daily.Confirmed + b_3Still.Hospitalised + b_4Daily.Deaths \times icu + b_5Daily.Confirmed \times icu + b_6Still.Hospitalised \times icu + \epsilon \end{eqnarray} \]

model1 <- lm(average_sentiment~  Daily.Deaths + Daily.Confirmed + Still.Hospitalised + Intensive.Care.Unit..ICU., 
             train.data)

tidy(model1) %>% as_tibble()
termestimatestd.errorstatisticp.value
(Intercept)-1.54    0.922   -1.67 0.11 
Daily.Deaths0.00132 0.0112  0.1180.907
Daily.Confirmed-9.34e-058.43e-05-1.11 0.28 
Still.Hospitalised0.0005770.0005761    0.328
Intensive.Care.Unit..ICU.0.00734 0.00493 1.49 0.151
glance(model1)
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.2010.0490.1991.320.29447.91-3.833.720.8282126
model2 <- lm(average_sentiment ~ (Daily.Confirmed + Daily.Deaths +Still.Hospitalised) * Intensive.Care.Unit..ICU., 
             train.data)

tidy(model2) %>% as_tibble()
termestimatestd.errorstatisticp.value
(Intercept)-8.41    7.57    -1.11 0.281 
Daily.Confirmed0.00157 0.0006252.52 0.0215
Daily.Deaths-0.3     0.115   -2.62 0.0175
Still.Hospitalised0.00384 0.00454 0.8450.409 
Intensive.Care.Unit..ICU.0.1     0.121   0.8310.417 
Daily.Confirmed:Intensive.Care.Unit..ICU.-2.61e-059.78e-06-2.67 0.0155
Daily.Deaths:Intensive.Care.Unit..ICU.0.0046  0.00174 2.64 0.0166
Still.Hospitalised:Intensive.Care.Unit..ICU.-4.13e-057.33e-05-0.5630.581 
glance(model2)
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.4950.2990.1712.520.0539713.9-9.761.560.5231826
summary(model1)
## 
## Call:
## lm(formula = average_sentiment ~ Daily.Deaths + Daily.Confirmed + 
##     Still.Hospitalised + Intensive.Care.Unit..ICU., data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24157 -0.12768 -0.01705  0.05730  0.46390 
## 
## Coefficients:
##                              Estimate  Std. Error t value Pr(>|t|)
## (Intercept)               -1.53616590  0.92152726  -1.667    0.110
## Daily.Deaths               0.00131739  0.01117317   0.118    0.907
## Daily.Confirmed           -0.00009339  0.00008429  -1.108    0.280
## Still.Hospitalised         0.00057723  0.00057589   1.002    0.328
## Intensive.Care.Unit..ICU.  0.00734444  0.00493239   1.489    0.151
## 
## Residual standard error: 0.1986 on 21 degrees of freedom
## Multiple R-squared:  0.2011, Adjusted R-squared:  0.04898 
## F-statistic: 1.322 on 4 and 21 DF,  p-value: 0.2944
summary(model2)
## 
## Call:
## lm(formula = average_sentiment ~ (Daily.Confirmed + Daily.Deaths + 
##     Still.Hospitalised) * Intensive.Care.Unit..ICU., data = train.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24710 -0.10603 -0.00932  0.09528  0.34621 
## 
## Coefficients:
##                                                  Estimate   Std. Error t value
## (Intercept)                                  -8.410352619  7.566961238  -1.111
## Daily.Confirmed                               0.001574364  0.000625087   2.519
## Daily.Deaths                                 -0.300146413  0.114706230  -2.617
## Still.Hospitalised                            0.003840899  0.004544835   0.845
## Intensive.Care.Unit..ICU.                     0.100410137  0.120896846   0.831
## Daily.Confirmed:Intensive.Care.Unit..ICU.    -0.000026150  0.000009783  -2.673
## Daily.Deaths:Intensive.Care.Unit..ICU.        0.004601271  0.001742310   2.641
## Still.Hospitalised:Intensive.Care.Unit..ICU. -0.000041252  0.000073300  -0.563
##                                              Pr(>|t|)  
## (Intercept)                                    0.2810  
## Daily.Confirmed                                0.0215 *
## Daily.Deaths                                   0.0175 *
## Still.Hospitalised                             0.4091  
## Intensive.Care.Unit..ICU.                      0.4171  
## Daily.Confirmed:Intensive.Care.Unit..ICU.      0.0155 *
## Daily.Deaths:Intensive.Care.Unit..ICU.         0.0166 *
## Still.Hospitalised:Intensive.Care.Unit..ICU.   0.5805  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1705 on 18 degrees of freedom
## Multiple R-squared:  0.4952, Adjusted R-squared:  0.2989 
## F-statistic: 2.522 on 7 and 18 DF,  p-value: 0.05385

We tested if model2, with interaction terms, enhances the explanatory power of the model using anova function.

anova(model1, model2)
Res.DfRSSDfSum of SqFPr(>F)
210.828            
180.52330.3053.490.0371

The results of the analysis suggest that adding the interaction terms significantly increases the R-squared of model2, as compared to model1.

Assumption Check

library(gvlma)
gvlma(model2)
## 
## Call:
## lm(formula = average_sentiment ~ (Daily.Confirmed + Daily.Deaths + 
##     Still.Hospitalised) * Intensive.Care.Unit..ICU., data = train.data)
## 
## Coefficients:
##                                  (Intercept)  
##                                  -8.41035262  
##                              Daily.Confirmed  
##                                   0.00157436  
##                                 Daily.Deaths  
##                                  -0.30014641  
##                           Still.Hospitalised  
##                                   0.00384090  
##                    Intensive.Care.Unit..ICU.  
##                                   0.10041014  
##    Daily.Confirmed:Intensive.Care.Unit..ICU.  
##                                  -0.00002615  
##       Daily.Deaths:Intensive.Care.Unit..ICU.  
##                                   0.00460127  
## Still.Hospitalised:Intensive.Care.Unit..ICU.  
##                                  -0.00004125  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model2) 
## 
##                      Value p-value                Decision
## Global Stat        3.20136  0.5247 Assumptions acceptable.
## Skewness           0.70805  0.4001 Assumptions acceptable.
## Kurtosis           0.14447  0.7039 Assumptions acceptable.
## Link Function      2.32459  0.1273 Assumptions acceptable.
## Heteroscedasticity 0.02425  0.8763 Assumptions acceptable.
library(ggthemes)
theme_set(theme_fivethirtyeight())

library(ggfortify)
autoplot(gvlma(model2))

Check Multicollinearity

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
vif(model1);vif(model2)
##              Daily.Deaths           Daily.Confirmed        Still.Hospitalised 
##                  1.075882                  1.087332                  1.124124 
## Intensive.Care.Unit..ICU. 
##                  1.195660
##                              Daily.Confirmed 
##                                     81.10537 
##                                 Daily.Deaths 
##                                    153.80602 
##                           Still.Hospitalised 
##                                     94.96315 
##                    Intensive.Care.Unit..ICU. 
##                                    974.33724 
##    Daily.Confirmed:Intensive.Care.Unit..ICU. 
##                                    164.60210 
##       Daily.Deaths:Intensive.Care.Unit..ICU. 
##                                    196.10806 
## Still.Hospitalised:Intensive.Care.Unit..ICU. 
##                                   1215.39954

Report the Results with kable in R Markdown

library(knitr) # Please install the package "knitr" first.
library(kableExtra) # You might want to use package "kableExtra" as well.
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:huxtable':
## 
##     add_footnote
## The following object is masked from 'package:dplyr':
## 
##     group_rows
kable(tidy(model2))%>%
  kable_styling("striped", full_width = T, fixed_thead = T) %>%
  column_spec(c(1, 5), bold = T) %>%
  row_spec(c(2, 4, 6,8), bold = T, color = "white", background = "#ff6347")
term estimate std.error statistic p.value
(Intercept) -8.4103526 7.5669612 -1.1114571 0.2809975
Daily.Confirmed 0.0015744 0.0006251 2.5186300 0.0214597
Daily.Deaths -0.3001464 0.1147062 -2.6166531 0.0174761
Still.Hospitalised 0.0038409 0.0045448 0.8451129 0.4091412
Intensive.Care.Unit..ICU. 0.1004101 0.1208968 0.8305439 0.4171120
Daily.Confirmed:Intensive.Care.Unit..ICU. -0.0000261 0.0000098 -2.6729591 0.0155165
Daily.Deaths:Intensive.Care.Unit..ICU. 0.0046013 0.0017423 2.6409030 0.0166049
Still.Hospitalised:Intensive.Care.Unit..ICU. -0.0000413 0.0000733 -0.5627803 0.5805272
kable(glance(model2))%>%
  kable_styling("striped", full_width = T, font_size = 12) %>%
  column_spec(c(2,4), bold = T, color = "white", background = "#ff6347")
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.4951796 0.2988606 0.1705067 2.522321 0.053852 7 13.88151 -9.763028 1.55984 0.5233058 18 26

The regression analysis came up with two significant interaction terms.

  • First, it appears that the relationships between Daily Intensive Care Unit(ICU) cases regarding average sentiment is different depending on daily confirmed cases.

  • Second, it appears that the relationships between Daily Intensive Care Unit(ICU) cases regarding average sentiment is different depending on daily deaths cases.

Visualize

To visualize the OLS regression analysis performed above, we stored the OLS regression model’s predictions.

Daily death is a discrete and countable variable in our model. To make it easier for interpretation, we categorized them into three levels (mean below 1SD, mean, and mean above 1SD). The following figure represents the three lines that represent differing Daily Deaths set at M-1SD, Mean, M+1SD, as noted above, and how differing Daily Deaths make differences to relationships between Intensive Care Cases and Average Sentiment.

#Daily death
merged_dataset %>% summarise(sd(Daily.Deaths), mean(Daily.Deaths)) %>% 
  mutate(`sd(Daily.Deaths)` = round(`sd(Daily.Deaths)`, digits = 0),
         `mean(Daily.Deaths)` = round(`mean(Daily.Deaths)`, digits = 0))
sd(Daily.Deaths)mean(Daily.Deaths)
312
grid_group3 <- merged_dataset %>% 
  data_grid(Intensive.Care.Unit..ICU. , Daily.Deaths = c(9,  12, 15), # c(-1.26, 0.00, 1.26),
            Still.Hospitalised =0, Daily.Confirmed = 0) %>% 
  add_predictions(model2)

grid <- grid_group3 %>% 
  mutate(Intensive.Care.Unit..ICU. = Intensive.Care.Unit..ICU. + mean(merged_dataset$Intensive.Care.Unit..ICU.), Daily.Deaths = factor(as.double(factor(Daily.Deaths))))


ggplot(grid, aes(x = Intensive.Care.Unit..ICU., y = pred, color = factor(Daily.Deaths))) +
  geom_line(size = 1) +
  scale_color_discrete(breaks = c(1, 2, 3), 
                       label=c("Low in Death", 
                               "Mean Death", 
                               "High in Death")) +
  labs(x = "Intensive Care Unit", 
       y = "Average Sentiment",
       color = "Daily Death") +
  coord_cartesian(ylim = c(-4, -0.8), xlim = c(100, 138)) +
  theme_linedraw() +
  theme(legend.position= "top")

The following figure represents the three lines that represent differing Daily Cases set at M-1SD, Mean, M+1SD. It will show how differing Daily Cases make differences to relationships between Intensive Care Cases and Average Sentiment.

#Daily confirm

merged_dataset %>% summarise(sd(Daily.Confirmed), mean(Daily.Confirmed)) %>% 
    mutate(`sd(Daily.Confirmed)` = round(`sd(Daily.Confirmed)`, digits = 0),
         `mean(Daily.Confirmed)` = format(`mean(Daily.Confirmed)`, digit=4, scientific=FALSE)) 
sd(Daily.Confirmed)mean(Daily.Confirmed)
6403206
grid_group4 <- merged_dataset %>% 
  data_grid(Intensive.Care.Unit..ICU. , Daily.Confirmed = c(2566, 3206, 3846),
            Still.Hospitalised =0, Daily.Deaths = 0) %>% 
  add_predictions(model2)


grid1 <- grid_group4 %>% 
  mutate(Intensive.Care.Unit..ICU. = Intensive.Care.Unit..ICU. + mean(merged_dataset$Intensive.Care.Unit..ICU.), Daily.Confirmed = factor(as.double(factor(Daily.Confirmed))))


ggplot(grid1, aes(x = Intensive.Care.Unit..ICU., y = pred, color = factor(Daily.Confirmed))) +
  geom_line(size = 0.7) +
  scale_color_discrete(breaks = c(1,2,3), 
                       label=c("Low in Daily Confirmed", 
                               "Mean Daily Confirmed", 
                               "High in Daily Confirmed")) +
  labs(x = "Intensive Care Unit", 
       y = "Average Sentiment",
       color = "Daily Confirmed") +
  coord_cartesian(ylim = c(-3.5, -1.5), xlim = c(100, 140)) +
  theme_linedraw() +
  theme(legend.position= "top")

pacman::p_load(jtools, huxtable, ggstance, interactions)

m1 <- lm(average_sentiment~  Daily.Deaths + Daily.Confirmed +Still.Hospitalised + Intensive.Care.Unit..ICU., 
         train.data)

m2 <- lm(average_sentiment ~ (Daily.Confirmed + Daily.Deaths +Still.Hospitalised) * Intensive.Care.Unit..ICU., 
         train.data)

export_summs(m1, m2, 
             error_format = "(t = {statistic}, p = {p.value})",
             align = "right",
             model.names = c("Main Effects Only", "with Interactions"),
             digits = 3)
Main Effects Onlywith Interactions
(Intercept)-1.536-8.410
(t = -1.667, p = 0.110)(t = -1.111, p = 0.281)
Daily.Deaths0.001-0.300 *
(t = 0.118, p = 0.907)(t = -2.617, p = 0.017)
Daily.Confirmed-0.0000.002 *
(t = -1.108, p = 0.280)(t = 2.519, p = 0.021)
Still.Hospitalised0.0010.004
(t = 1.002, p = 0.328)(t = 0.845, p = 0.409)
Intensive.Care.Unit..ICU.0.0070.100
(t = 1.489, p = 0.151)(t = 0.831, p = 0.417)
Daily.Confirmed:Intensive.Care.Unit..ICU.-0.000 *
(t = -2.673, p = 0.016)
Daily.Deaths:Intensive.Care.Unit..ICU.0.005 *
(t = 2.641, p = 0.017)
Still.Hospitalised:Intensive.Care.Unit..ICU.-0.000
(t = -0.563, p = 0.581)
N2626
R20.2010.495
*** p < 0.001; ** p < 0.01; * p < 0.05.
plot_summs(m1, m2, 
           scale = T,
           plot.distributions = T,
           model.names = c("Main Effects Only", "with Interactions")) +
  theme(legend.position = "top")

sim_slopes(m2,
           pred = Intensive.Care.Unit..ICU., 
           modx = Daily.Deaths,
           johnson_neyman = F)
## SIMPLE SLOPES ANALYSIS 
## 
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths =  8.38994 (- 1 SD): 
## 
##    Est.   S.E.   t val.      p
## ------- ------ -------- ------
##   -0.01   0.01    -1.54   0.14
## 
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths = 12.07692 (Mean): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.00   0.01     0.83   0.42
## 
## Slope of Intensive.Care.Unit..ICU. when Daily.Deaths = 15.76391 (+ 1 SD): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.02   0.01     2.48   0.02
sim_slopes(m2,
           pred = Daily.Deaths, 
           modx = Intensive.Care.Unit..ICU.,
           johnson_neyman = T)
## JOHNSON-NEYMAN INTERVAL 
## 
## When Intensive.Care.Unit..ICU. is OUTSIDE the interval [57.22, 71.98], the
## slope of Daily.Deaths is p < .05.
## 
## Note: The range of observed values of Intensive.Care.Unit..ICU. is [41.00,
## 75.00]
## 
## SIMPLE SLOPES ANALYSIS 
## 
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 54.00306 (- 1 SD): 
## 
##    Est.   S.E.   t val.      p
## ------- ------ -------- ------
##   -0.05   0.02    -2.30   0.03
## 
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 62.80769 (Mean): 
## 
##    Est.   S.E.   t val.      p
## ------- ------ -------- ------
##   -0.01   0.01    -1.02   0.32
## 
## Slope of Daily.Deaths when Intensive.Care.Unit..ICU. = 71.61232 (+ 1 SD): 
## 
##   Est.   S.E.   t val.      p
## ------ ------ -------- ------
##   0.03   0.01     2.05   0.05

Interpretation of the Results

From our data science project, we could find the following three findings:

  1. The first case of COVID-19 found in Singapore was confirmed on 23 January 2020, and since nationwide partial lockdown or circuit breaker kicked in from 7 April 2020 to 1 June 2020, Singapore have been experiencing policies change from time to time in efforts to be more effectively cope with the surges. As new norm has been rooted in local lifestyle, no apparent jump or trend has been observed in public sentiment. In fact, we further analyse the effects from the most recent key announcements, a sudden increase of number of tweets and jump in overall sentiment appears that there is a causation effect, but it did not significantly drive any specific emotion class to form trending. Despite that social curb extension announced expected to cause negative sentiment, PM address has overwhelmed the effects on it and sentiment emerge with more trusts to the leadership. Thus, information delivering, create awareness and generate leads seems to be effective in producing positive public sentiment.

  2. The relationship between daily average sentiment and daily ICU cases depends on daily cases and daily death. It appears that if there is a high death with lower daily ICU cases, the average sentiment will be lower. However, if there is a higher ICU cases with high death rate, the average sentiment is higher. If there is a low death with low intensive care case, the average sentiment is higher and if there is a low death with high intensive care cases, the sentiment is lower.

  3. The relationship between daily cases and daily ICU is that be it low or high daily cases, the more intensive care cases, the better the average sentiment. The result is constant for both low or high daily cases.

Implications

There are two angles that we took to approach this. First is to analyse if there is any impact on the twitter feeds based on the major events announced by the government. From the two specific events that we noted, it was found that there is a sudden spikes in the number of tweets. Secondly we have analysed the correlation between daily average sentiment and the the number of Covid cases(death, ICU, hospitalized and daily cases). As shown in the result, we feel that we will need a longer time frame to analyse as the result shows that be it low or high daily cases, the result is constant but a higher death rate with low ICU cases will have a lower average sentiment.

Limitations and Future Directions

The public sentiment reflected and analysed by tweets data especially effective among younger age group who spend most of their time in social media and carefully tweets what is in their mind. This can serve only a small sample of a whole local population. If possible, we also need to filter out news publisher tweets and focus on individual tweets. For this analysis, we are comparing sentiment change within group based on causal inference driven by nation-wide speech and announcements. On how COVID-19 has changed the sentiment in the society, we have to identify and aware of timeline on key events happened and also view from a longer term from pre-covid and covid era, hence a relative longer time period of data will provide more insights for analysing the differences. We are hold back by the free Twitter developer account which eligible to extract tweets up to 7 days in the past.

Local slang words or Singlish are casually and frequently used among local community and most of them could be a good gauge of emotion, E.g. even a dialect vulgar should reflect anger, thus enriching the emotion dictionary like NRC can better tune to understand the positive and negative in sentiment.

Feature Importance describe which features are relevant and is another aspect we should explore to help us better understanding of solved problem and sometimes lead to model improvements for accuracy by employing feature selection.

In conclusion, we feel that if we can have a longer runway, we could yield better result. Probably for a 6 months period so that we can analyse more chain of events. Moreover we feel that covid has been there for almost 2 years, public has probably taken this as the new norm and hence a longer period of analysis will give us more data points to analyse.

References

[1] Julia Silge & David Robinson Text Mining with R - A TIDY APPROACH O’Reilly (2017)

[2] Andrea Cirillo R Data Mining - Implement data mining techniques through practical use cases and real-world datasets Packt> (2017)

[3] Tony Carilli R Companion to Real Econometrics February 2021 https://bookdown.org/carillitony/bailey/chp11.html

[4] Ashwin Malshe(2019-06-25) Data Analytics Applications https://ashgreat.github.io/analyticsAppBook/collect-tweets.html

[5]Hui Xiang Chua Covid-19 Singapore https://data.world/hxchua/covid-19-singapore

[6]Singapore Public Data COVID-19 case numbers https://data.gov.sg/dataset/covid-19-case-numbers

Appendix

Custom Stop-words for Text Pre-processing in Word Cloud data overview

“singapore”, “covid”, “covid19”,“positive”,“negative”,“oct”,“nov”,“news”,“amp”,“reuters”,“news”,“daily”, “malaysia”,“november”,“october”,“october”,“press”,“journal”,“amid”,“weekly”,“days”,“weeks”,“china”, “chinese”,“report”,“am”,“pm”,“dont”,“taking”,“found”,“morning”,“bloomberg”,“months”,“month”,“india”, “media”,“week”,“read”,“reports”,“data”,“europe”,“monday”,“tuesday”,“wednesday”,“thursday”,“friday”, “satursday”,“sunday”,“wall”,“street”

The objective is to clean those are less relevant and very little meaning to find sentiment, such as punctuation, special character, prefix with number, hashtag, alias, links and custom terms above.

We removed duplicated text in tweets, sent from the same screen name multiple times. For instance, there are several news publishers have posted the same tweet on different days.

sessionInfo()

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] interactions_1.1.5 ggstance_0.3.5     jtools_2.1.4       kableExtra_1.3.4  
##  [5] knitr_1.33         car_3.0-11         carData_3.0-4      ggfortify_0.4.12  
##  [9] ggthemes_4.2.4     gvlma_1.0.0.3      DT_0.19            caret_6.0-90      
## [13] glue_1.4.2         rtweet_0.7.0.9000  scales_1.1.1       sandwich_3.0-1    
## [17] Hmisc_4.5-0        Formula_1.2-4      survival_3.2-13    lattice_0.20-44   
## [21] psych_2.1.9        gridExtra_2.3      huxtable_5.4.0     textdata_0.4.1    
## [25] reshape2_1.4.4     wordcloud_2.6      RColorBrewer_1.1-2 wordcloud2_0.2.1  
## [29] tidytext_0.3.1     lubridate_1.7.10   modelr_0.1.8       broom_0.7.9       
## [33] forcats_0.5.1      stringr_1.4.0      dplyr_1.0.7        purrr_0.3.4       
## [37] readr_2.0.1        tidyr_1.1.4        tibble_3.1.5       ggplot2_3.3.5     
## [41] tidyverse_1.3.1   
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1         backports_1.2.1      systemfonts_1.0.2   
##   [4] plyr_1.8.6           splines_4.1.0        listenv_0.8.0       
##   [7] SnowballC_0.7.0      digest_0.6.28        foreach_1.5.1       
##  [10] htmltools_0.5.2      fansi_0.5.0          magrittr_2.0.1      
##  [13] checkmate_2.0.0      cluster_2.1.2        openxlsx_4.2.4      
##  [16] tzdb_0.1.2           recipes_0.1.17       globals_0.14.0      
##  [19] gower_0.2.2          svglite_2.0.0        jpeg_0.1-9          
##  [22] colorspace_2.0-2     rvest_1.0.1          rappdirs_0.3.3      
##  [25] haven_2.4.3          xfun_0.25            crayon_1.4.2        
##  [28] jsonlite_1.7.2       zoo_1.8-9            iterators_1.0.13    
##  [31] gtable_0.3.0         ipred_0.9-12         webshot_0.5.2       
##  [34] future.apply_1.8.1   abind_1.4-5          DBI_1.1.1           
##  [37] Rcpp_1.0.7           viridisLite_0.4.0    htmlTable_2.2.1     
##  [40] tmvnsim_1.0-2        foreign_0.8-81       radarchart_0.3.1    
##  [43] stats4_4.1.0         lava_1.6.10          prodlim_2019.11.13  
##  [46] htmlwidgets_1.5.4    httr_1.4.2           ellipsis_0.3.2      
##  [49] pkgconfig_2.0.3      farver_2.1.0         nnet_7.3-16         
##  [52] sass_0.4.0           dbplyr_2.1.1         utf8_1.2.2          
##  [55] tidyselect_1.1.1     labeling_0.4.2       rlang_0.4.12        
##  [58] munsell_0.5.0        cellranger_1.1.0     tools_4.1.0         
##  [61] cli_3.1.0            generics_0.1.1       pacman_0.5.1        
##  [64] evaluate_0.14        fastmap_1.1.0        yaml_2.2.1          
##  [67] ModelMetrics_1.2.2.2 fs_1.5.0             pander_0.6.4        
##  [70] zip_2.2.0            future_1.23.0        nlme_3.1-152        
##  [73] xml2_1.3.2           tokenizers_0.2.1     compiler_4.1.0      
##  [76] rstudioapi_0.13      curl_4.3.2           png_0.1-7           
##  [79] reprex_2.0.1         bslib_0.2.5.1        stringi_1.7.4       
##  [82] highr_0.9            Matrix_1.3-4         commonmark_1.7      
##  [85] vctrs_0.3.8          pillar_1.6.4         lifecycle_1.0.1     
##  [88] jquerylib_0.1.4      data.table_1.14.2    R6_2.5.1            
##  [91] latticeExtra_0.6-29  rio_0.5.27           janeaustenr_0.1.5   
##  [94] parallelly_1.29.0    codetools_0.2-18     MASS_7.3-54         
##  [97] assertthat_0.2.1     withr_2.4.2          mnormt_2.0.2        
## [100] broom.mixed_0.2.7    mgcv_1.8-36          parallel_4.1.0      
## [103] hms_1.1.1            grid_4.1.0           rpart_4.1-15        
## [106] timeDate_3043.102    class_7.3-19         rmarkdown_2.10      
## [109] pROC_1.18.0          base64enc_0.1-3

Contribution Statement

There are two main segments for this project which are covid cases and twitter mining. Both of us have been involved in these two parts and in fact we met up twice a week for the last month for process to explaining our work and difficulties and reviewing outcome on this project. Yun Xun main focus is on twitter mining and to analyse the sentiment based on regression discontinuity on special occasion whereas Alex is more focus on Covid cases and modeling. Having said that and as mentioned above, there is overlapping of effort from both of us as we have been trying to work seamlessly together to gel the all pieces of work into a complete project.